Getting Started

  • Go to the course GitHub organization page and find the repository entitled “ae07-GitHubUsername”.
  • Click the green “code” button and copy the SSH URL.
  • Go to RStudio, select File \(\rightarrow\) New Project \(\rightarrow\) Version Control \(\rightarrow\) Git and paste the URL.
  • Open the .Rmd file and replace “Your Name” with your name.

Covid-19 Community Level

The following paragraph and figure directly come from the website of Centers for Disease Control and Prevention (CDC):

COVID-19 Community Levels are a new tool to help communities decide what prevention steps to take based on the latest data. Levels can be low, medium, or high and are determined by looking at hospital beds being used, hospital admissions, and the total number of new COVID-19 cases in an area.

We will investigate how Covid-19 community levels have changed across US counties since the first data release on Feb 24, 2022. The data are weekly updated, and we use data slightly modified from the version updated on May 5, 2022. Click here if you are interested in raw data.

We first load relevant packages:

library(tidyverse)
library(sf)

Part 1: Spatial Data are Different

We load Covid-19 community level data in a csv file. This is our typical “tidy” data frame.

covid <- read_csv("data/covid.csv") 
covid
## # A tibble: 35,464 × 12
##    county  county_fips state  county_populati… health_service_… health_service_…
##    <chr>   <chr>       <chr>             <dbl>            <dbl> <chr>           
##  1 Americ… 60000       Ameri…            47392              901 American Samoa  
##  2 Guam    66000       Guam             168489              902 Guam            
##  3 Common… 69000       Commo…            51851              903 Commonwealth of…
##  4 United… 78000       Unite…           106290              905 United States V…
##  5 Americ… 60000       Ameri…            47392              901 American Samoa  
##  6 Guam    66000       Guam             168489              902 Guam            
##  7 Common… 69000       Commo…            51851              903 Commonwealth of…
##  8 United… 78000       Unite…           106290              905 United States V…
##  9 Americ… 60000       Ameri…            47392              901 American Samoa  
## 10 Guam    66000       Guam             168489              902 Guam            
## # … with 35,454 more rows, and 6 more variables:
## #   health_service_area_population <dbl>,
## #   covid_inpatient_bed_utilization <dbl>,
## #   covid_hospital_admissions_per_100k <dbl>, covid_cases_per_100k <dbl>,
## #   covid_community_level <chr>, date <date>

We use filter and select to focus on the contiguous US and the following variables:

variable name description
state State name
county County name
county_fips Federal Information Processing Standards (FIPS) five character county code
county_population County population (2019 Census estimate)
covid_community_level Covid-19 community level
date Date of data release
covid <- covid %>% 
  filter(!(state %in% c("United States Virgin Islands", 
                        "Commonwealth of the Northern Mariana Islands", 
                        "American Samoa",
                        "Puerto Rico", 
                        "Alaska", 
                        "Hawaii", 
                        "Guam"))) %>% 
  select(state, county, county_fips, county_population, 
         covid_community_level, date)

In order to draw maps based on Covid-19 community level, we need another data set that has information on geometry of US counties. We read the shapefile us_counties.shp using st_read(). Its original shapefile (.shp) is downloadable from the US Census Bureau under “cb_2018_us_county_20m.zip” (resolution level 1:20,000,000).

The loaded data frame has the following variables:

variable name description
statefips State FIPS
countyfips County FIPS
county County name
geometry geometry features

This is an sf object.

us_counties <- st_read("data/us_counties.shp", quiet = TRUE)
us_counties
## Simple feature collection with 3108 features and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS:  NAD83
## First 10 features:
##    statefips countyfips   county                       geometry
## 1         37        017   Bladen MULTIPOLYGON (((-78.902 34....
## 2         37        167   Stanly MULTIPOLYGON (((-80.49737 3...
## 3         39        153   Summit MULTIPOLYGON (((-81.68699 4...
## 4         42        113 Sullivan MULTIPOLYGON (((-76.81373 4...
## 5         48        459   Upshur MULTIPOLYGON (((-95.15274 3...
## 6         48        049    Brown MULTIPOLYGON (((-99.19587 3...
## 7         45        021 Cherokee MULTIPOLYGON (((-81.87441 3...
## 8         01        043  Cullman MULTIPOLYGON (((-87.11199 3...
## 9         54        023    Grant MULTIPOLYGON (((-79.48687 3...
## 10        46        055   Haakon MULTIPOLYGON (((-102.0011 4...

Q - What differences do you observe when comparing a typical tidy data frame to the new simple feature object?

Part 2: sf and dplyr

The sf package plays nicely with our earlier data wrangling functions from dplyr.

filter()

Compare county FIPS from two data sets by filtering for Durham county in NC. The state FIPS for NC is 37. Check variable types before filtering.

us_counties %>% 
  filter(county == "Durham", statefips == "37") 
## Simple feature collection with 1 feature and 3 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -79.0163 ymin: 35.86321 xmax: -78.69932 ymax: 36.23932
## Geodetic CRS:  NAD83
##   statefips countyfips county                       geometry
## 1        37        063 Durham MULTIPOLYGON (((-79.0095 35...
covid %>% 
  filter(county == "Durham County", state == "North Carolina")
## # A tibble: 11 × 6
##    state     county    county_fips county_populati… covid_community_… date      
##    <chr>     <chr>     <chr>                  <dbl> <chr>             <date>    
##  1 North Ca… Durham C… 37063                 321488 High              2022-02-24
##  2 North Ca… Durham C… 37063                 321488 Medium            2022-03-03
##  3 North Ca… Durham C… 37063                 321488 Low               2022-03-10
##  4 North Ca… Durham C… 37063                 321488 Low               2022-03-24
##  5 North Ca… Durham C… 37063                 321488 Low               2022-03-17
##  6 North Ca… Durham C… 37063                 321488 Low               2022-03-31
##  7 North Ca… Durham C… 37063                 321488 Low               2022-04-07
##  8 North Ca… Durham C… 37063                 321488 Low               2022-04-14
##  9 North Ca… Durham C… 37063                 321488 Low               2022-04-21
## 10 North Ca… Durham C… 37063                 321488 Low               2022-04-28
## 11 North Ca… Durham C… 37063                 321488 Low               2022-05-05

Q - How are they different?

mutate()

In us_counties data, we will create a new variable county_fips that exactly matches county_fips from covid data. Here we use paste0() that concatenates strings.

us_counties <- us_counties %>% 
  mutate(county_fips = paste0(statefips, countyfips))
us_counties$county_fips %>% 
  head(5)
## [1] "37017" "37167" "39153" "42113" "48459"

select()

Select county and county_fips from us_counties.

us_counties %>% 
  select(county, county_fips)
## Simple feature collection with 3108 features and 2 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS:  NAD83
## First 10 features:
##      county county_fips                       geometry
## 1    Bladen       37017 MULTIPOLYGON (((-78.902 34....
## 2    Stanly       37167 MULTIPOLYGON (((-80.49737 3...
## 3    Summit       39153 MULTIPOLYGON (((-81.68699 4...
## 4  Sullivan       42113 MULTIPOLYGON (((-76.81373 4...
## 5    Upshur       48459 MULTIPOLYGON (((-95.15274 3...
## 6     Brown       48049 MULTIPOLYGON (((-99.19587 3...
## 7  Cherokee       45021 MULTIPOLYGON (((-81.87441 3...
## 8   Cullman       01043 MULTIPOLYGON (((-87.11199 3...
## 9     Grant       54023 MULTIPOLYGON (((-79.48687 3...
## 10   Haakon       46055 MULTIPOLYGON (((-102.0011 4...

Notice that geometries are “sticky”. They are kept until deliberately dropped using st_drop_geometry. Manipulating spatial data objects is similar, but not identical to manipulating data frames.

us_counties %>% 
  select(county, county_fips) %>% 
  st_drop_geometry() %>% 
  slice(1:5)
##     county county_fips
## 1   Bladen       37017
## 2   Stanly       37167
## 3   Summit       39153
## 4 Sullivan       42113
## 5   Upshur       48459

something_join()

Create a new sf object covid_sf by joining covid and us_counties keeping all rows in covid. Note that two variable names are common, namely, county_fips and county, in both data frames.

Q - Which variable should we use to join the data frames by? Can we use any? Why or why not?

covid_sf <- covid %>% 
  left_join(us_counties, by = "county_fips") %>% # tbl_df/ tbl/ data.frame, no sf!
  st_as_sf()
covid_sf
## Simple feature collection with 34188 features and 9 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
## Geodetic CRS:  NAD83
## # A tibble: 34,188 × 10
##    state   county.x    county_fips county_populati… covid_community_… date      
##    <chr>   <chr>       <chr>                  <dbl> <chr>             <date>    
##  1 Alabama Geneva Cou… 01061                  26271 High              2022-02-24
##  2 Alabama Greene Cou… 01063                   8111 Medium            2022-02-24
##  3 Alabama Hale County 01065                  14651 Medium            2022-02-24
##  4 Alabama Henry Coun… 01067                  17205 High              2022-02-24
##  5 Alabama Houston Co… 01069                 105882 High              2022-02-24
##  6 Alabama Jackson Co… 01071                  51626 High              2022-02-24
##  7 Alabama Jefferson … 01073                 658573 High              2022-02-24
##  8 Alabama Lamar Coun… 01075                  13805 Medium            2022-02-24
##  9 Alabama Lauderdale… 01077                  92729 Medium            2022-02-24
## 10 Alabama Lawrence C… 01079                  32924 Low               2022-02-24
## # … with 34,178 more rows, and 4 more variables: statefips <chr>,
## #   countyfips <chr>, county.y <chr>, geometry <MULTIPOLYGON [°]>

Q - Check what happens to the common variable that is not used in joining.

group_by(), summarize()

Using covid, let’s compute the total population across counties in different Covid-19 community levels based on the latest data (date == "2022-05-05").

covid %>% 
  filter(date == "2022-05-05") %>% 
  group_by(covid_community_level) %>%
  summarize(tot_pop = sum(county_population))
## # A tibble: 3 × 2
##   covid_community_level   tot_pop
##   <chr>                     <dbl>
## 1 High                   13272001
## 2 Low                   251130313
## 3 Medium                 61689792

Part 3: sf and ggplot

We will examine how the Covid-19 community levels change over space and over time. As usual, we can build up a visualization layer-by-layer beginning with ggplot. Let’s start by making a basic plot of all US counties.

covid_sf %>% 
  ggplot() + 
  geom_sf() + 
  labs(title = "US counties")

Let’s try NC.

covid_sf %>% 
  filter(state == "North Carolina") %>% 
  ggplot() + 
  geom_sf() + 
  labs(title = "NC counties")

Now adjust the theme with theme_bw().

covid_sf %>% 
  filter(state == "North Carolina") %>% 
  ggplot() + 
  geom_sf() + 
  labs(title = "NC counties") +
  theme_bw() # white canvas

If you liked the theme, we can globally fix it as a default theme instead of attaching that layer for every ggplot.

theme_set(theme_bw())

We now fill each county by the latest value of Covid-19 community level (date == "2022-05-05"). To match colors used in CDC, we use scale_fill_manual().

covid_sf %>% 
  filter(date == "2022-05-05") %>% 
  ggplot() + 
  geom_sf(aes(fill = covid_community_level)) + 
  labs(title = "Covid-19 community level across US counties as of May 5, 2022", 
       fill = "Covid-19 community level") +
  scale_fill_manual(values = c("High" = "#fb8b5b",
                               "Medium" = "#f9f99d", 
                               "Low" = "#00cc99"))

Now adjust color in geom_sf to change the color of the county borders.

covid_sf %>% 
  filter(date == "2022-05-05") %>% 
  ggplot() + 
  geom_sf(aes(fill = covid_community_level), # mapping
          color = NA) + # setting
  labs(title = "Covid-19 community level across US counties as of May 5, 2022", 
       fill = "Covid-19 community level") +
  scale_fill_manual(values = c("High" = "#fb8b5b",
                               "Medium" = "#f9f99d", 
                               "Low" = "#00cc99"))

Adjust the width of the county borders using size.

covid_sf %>% 
  filter(date == "2022-05-05") %>% 
  ggplot() + 
  geom_sf(aes(fill = covid_community_level), # mapping
          size = 0) + # setting
  labs(title = "Covid-19 community level across US counties as of May 5, 2022", 
       fill = "Covid-19 community level") +
  scale_fill_manual(values = c("High" = "#fb8b5b",
                               "Medium" = "#f9f99d", 
                               "Low" = "#00cc99"))

Finally, adjust the transparency using alpha by mapping

covid_sf %>% 
  filter(date == "2022-05-05") %>% 
  ggplot() + 
  geom_sf(aes(fill = covid_community_level), # mapping
          alpha = 1) + # setting
  labs(title = "Covid-19 community level across US counties as of May 5, 2022", 
       fill = "Covid-19 community level") +
  scale_fill_manual(values = c("High" = "#fb8b5b",
                               "Medium" = "#f9f99d", 
                               "Low" = "#00cc99"))

Q - What is your final choice of color, size, and alpha?

Q - Create the same plot for NC counties only. Different choices of color, size, and alpha might be more appealing in this case.

covid_sf %>% 
  filter(date == "2022-05-05", 
         state == "North Carolina") %>% 
  ggplot() + 
  geom_sf(aes(fill = covid_community_level)) + 
  labs(title = "Covid-19 community level across NC counties as of May 5, 2022", 
       fill = "Covid-19 community level") +
  scale_fill_manual(values = c("High" = "#fb8b5b",
                               "Medium" = "#f9f99d", 
                               "Low" = "#00cc99"))

Q - What is the county that has a different community level than the rest of the counties in NC?

We can visualize temporal trends of Covid-19 community levels by faceting.

Q - Create a faceted plot of US counties over all available dates where each county is filled with Covid-19 community level. Briefly explain the spatial and temporal trends.

Q - Repeat the previous exercise for states of your choosing.

covid_sf %>% 
  filter(state %in% c("New York", "Vermont", "New Hampshire", 
                      "Maine", "Massachusetts", "Connecticut", 
                      "Rhode Island", "New Jersey")) %>% 
  ggplot() + 
  geom_sf(aes(fill = covid_community_level)) + 
  facet_wrap(~ date) + 
  labs(title = "Covid-19 community level in Northeastern US since Feb 2022", 
       fill = "Covid-19 community level") +
  scale_fill_manual(values = c("High" = "#fb8b5b",
                               "Medium" = "#f9f99d", 
                               "Low" = "#00cc99")) + 
  theme(legend.position = "bottom")

Practice

  1. Write a brief research question that you could answer with this dataset and then investigate it here.

  2. What are limitations of your visualizations above?

Submitting Application Exercises

  • Once you have completed the activity, push your final changes to your GitHub repo.
  • Make sure you committed at least three times.
  • Check that your repo is updated on GitHub, and that’s all you need to do to submit application exercises for participation.